1 Mg Mall,Bangalore approched us to know better insights of their customers. For that Analysis they provided us a dataset maintained by them. The dataset contains different Demographic information and Behavioral data of their customers.
The data set contains Customers annual income and spending score(it's a number in range 1-100 which shows the customer's spending ability). We will use different clustering algorithms to segment those customers and analyse those clusters to explore their customers.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#dataset = pd.read_excel('Tenovia2.xlsx')
dataset = pd.read_csv('Tenovia2.csv')
dataset.head()
dataset.drop(['CustomerID'],axis = 1, inplace = True)
genders = dataset.Gender.value_counts()
sns.set_style("darkgrid")
plt.figure(figsize=(10,4))
sns.barplot(x=genders.index, y=genders.values, palette="Blues_d")
plt.show()
The bar plot clearly says that Female customers are more than Male customers in this 1 Mg Mall.
dataset.describe()
plt.figure(figsize=(15,6))
plt.subplot(1,2,1)
sns.boxplot(y=dataset["Spending Score (1-100)"], color="red")
plt.subplot(1,2,2)
sns.boxplot(y=dataset['Annual Income\n(INR)'], color="blue")
plt.show()
So, Here We can see from the above plot and discription table that the age group of 28 to 49 are the major customers of the 1 Mg Mall and their Income in range of 41-78 INR and their spending score in range of 34-73(Out of 100).
plt.figure(figsize=(20,10))
x = dataset['Annual Income\n(INR)']
y = dataset['Age']
z = dataset['Spending Score (1-100)']
sns.lineplot(x, y, color = 'green')
sns.lineplot(x, z, color = 'orange')
plt.title('Annual Income vs Age and Spending Score', fontsize = 20)
plt.show()
In this Plot green line represents how the Annual Income(INR) varies with Age, and the orange line in the plot shows how Annual Income and the Spending Score varying.
plt.figure(figsize=(20,10))
sns.boxplot(
data=dataset,
x='Age',
y='Spending Score (1-100)',
color='blue')
plt.title('Age vs Spending Score', fontsize = 20)
Above Box plot presents the Spending Score's of different age group of customers.
plt.figure(1,figsize=(15,7))
n=0
for x in ['Age','Annual Income\n(INR)','Spending Score (1-100)']:
for y in ['Age','Annual Income\n(INR)','Spending Score (1-100)']:
n+=1
plt.subplot(3,3,n)
plt.subplots_adjust(hspace=0.5,wspace=0.5)
sns.regplot(x=x,y=y,data=dataset)
plt.ylabel(y.split()[0]+''+y.split()[1] if len(y.split())>1 else y)
plt.show()
In the above plot's we can clearly see how Income, Spending Score varying with age group and vice versa too.
X = dataset.iloc[:,1:].values
#Using the elbow method to find the optimum number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
km=KMeans(n_clusters=i,init='k-means++', max_iter=500, n_init=10, random_state=0)
km.fit(X)
wcss.append(km.inertia_)
plt.figure(figsize=(8,8))
plt.plot(range(1,11),wcss,'r',marker='o', markersize=10)
plt.axvline(5, ls="--", c="b")
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Within cluster sum of square')
plt.show()
Based on the elbow plot above, we can choose 4 to 6 clusters.
As here elbow is not much clear let us try to visualize the clusters with mid elbow value ie. 5.
X= dataset[['Age', 'Annual Income\n(INR)', 'Spending Score (1-100)']]
# initialise and fit K-Means model
KM_5_clusters = KMeans(n_clusters=5, init='k-means++').fit(X)
KM5_clustered = X.copy()
# append labels to points
KM5_clustered.loc[:,'Cluster'] = KM_5_clusters.labels_
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))
scat_1 = sns.scatterplot('Annual Income\n(INR)', 'Spending Score (1-100)', data=KM5_clustered,
hue='Cluster', ax=axes[0], palette='Set1', legend='full')
sns.scatterplot('Age', 'Spending Score (1-100)', data=KM5_clustered,
hue='Cluster', palette='Set1', ax=axes[1], legend='full')
axes[0].scatter(KM_5_clusters.cluster_centers_[:,1],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters.cluster_centers_[:,0],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
K-Means algorithm generated the following 5 clusters:
import plotly as py
import plotly.graph_objs as go
def tracer(db, n, name):
return go.Scatter3d(
x = db[db['Cluster']==n]['Age'],
y = db[db['Cluster']==n]['Spending Score (1-100)'],
z = db[db['Cluster']==n]['Annual Income\n(INR)'],
mode = 'markers',
name = name,
marker = dict(
size = 5
)
)
e0 = tracer(KM5_clustered, 0, 'low annual income and high spending')
e1 = tracer(KM5_clustered, 1, 'medium annual income and medium spending')
e2 = tracer(KM5_clustered, 2, 'high annual income and low spending')
e3 = tracer(KM5_clustered, 3, 'high annual income and high spending')
e4 = tracer(KM5_clustered, 4, 'low annual income and low spending')
data = [e0, e1, e2, e3, e4]
layout = go.Layout(
title = 'Clusters by K-Means',
scene = dict(
xaxis = dict(title = 'Age'),
yaxis = dict(title = 'Spending Score'),
zaxis = dict(title = 'Annual Income')
)
#,width=1000,
#height=1000
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
KM_clust_sizes = KM5_clustered.groupby('Cluster').size().to_frame()
KM_clust_sizes.columns = ["KM_size"]
KM_clust_sizes
The biggest cluster is a cluster number 2 with 77 observations (high annual income and low spending clients). There are the smallest ones with 23 observations (cluster 4 low annual income and low spending clients).
KM_6_clusters = KMeans(n_clusters=6, init='k-means++').fit(X)
KM6_clustered = X.copy()
KM6_clustered.loc[:,'Cluster'] = KM_6_clusters.labels_
fig2, (axes) = plt.subplots(1,2,figsize=(12,5))
sns.scatterplot('Annual Income\n(INR)', 'Spending Score (1-100)', data=KM6_clustered,
hue='Cluster', ax=axes[0], palette='Set1', legend='full')
sns.scatterplot('Age', 'Spending Score (1-100)', data=KM6_clustered,
hue='Cluster', palette='Set1', ax=axes[1], legend='full')
# plotting centroids
axes[0].scatter(KM_6_clusters.cluster_centers_[:,1], KM_6_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_6_clusters.cluster_centers_[:,0], KM_6_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
K-Means algorithm generated the following 6 clusters:
KM6_clust_size = KM6_clustered.groupby('Cluster').size().to_frame()
KM6_clust_size.columns = ["KM_size"]
KM6_clust_size
So, Finally found that the biggest cluster is a cluster 2 with 45 observations (Customers with medium annual income and medium spending clients). There are the smallest one is cluster 4 with 21 observations (Customers with low annual income and low spending clients).
So Finally using K-Means algorithm with 6_clusters provides Better Customer Analysis.